How I Stopped Worrying about the Twitter Archive at the Library of Congress and Learned to Build a Little One for Myself
نویسنده
چکیده
Twitter is among the commonest sources of data employed in social media research mainly because of its convenient APIs to collect tweets. However, most researchers do not have access to the expensive Firehose and Twitter Historical Archive, and they must rely on data collected with free APIs whose representativeness has been questioned. In 2010 the Library of Congress announced an agreement with Twitter to provide researchers access to the whole Twitter Archive. However, such a task proved to be daunting and, at the moment of this writing, no researcher has had the opportunity to access such materials. Still, there have been experiences that proved that smaller searchable archives are feasible and, therefore, amenable for academics to build with relatively little resources. In this paper I describe my efforts to build one of such archives, covering the first three years of Twitter (actually from March 2006 to July 2009) and containing 1.48 billion tweets. If you carefully follow my directions you may have your very own little Twitter Historical Archive and you may forget about paying for historical tweets. Please note that to achieve that you should be proficient in some programming language, knowable about Twitter APIs, and have some basic knowledge on ElasticSearch; moreover, you may very well get disappointed by the quality of the contents of the final dataset. 1 ar X iv :1 61 1. 08 14 4v 1 [ cs .C Y ] 2 4 N ov 2 01 6 Twitter, light of my life, mire of my drive Twitter has become the de facto source of data for most social media research1; and that is not because of Twitter being the most popular online social network or because of the high quality of the data it provides2, but because it offers a convenient API to collect amounts of data that seem–but rarely are–massive. Those researching Twitter in the broadest sense of the term tend to worship–and hope to eventually reach–not one but two “holy grails”, namely, Twitter’s Firehose (i.e., the whole stream of public tweets published in real time) and Twitter’s Historical Archive (i.e., the whole set of public tweets since the beginning of the service in 2006). Purportedly, such kind of data could provide extremely valuable insights about our culture and society, in addition to allow a variety of natural experiments about different kinds of social interactions–see, for instance, [23]. The truth is that both sources of data are readily available, but at high prices3 and, therefore, most researchers content themselves with less shiny–but gratis–materials such as the public streaming API (purportedly a 1% sample of the whole Firehose), and their own collections of tweets–obtained either by filtering the streaming API or by using the search API. Such kind of gratis datasets face two major issues: on one hand their representativeness is questioned (e.g., [15, 22]), and on the other hand they cannot be publicly released according to Twitter’s TOS (Terms of Service)4. This means that a huge amount of findings in the academic literature cannot be replicated without enormous–and redundant–efforts, and they may be perfectly wrong given that the data on which they rely is not really representative of the whole of Twitter. Given such state of matters, many of us welcomed the agreement between the Library of Congress and Twitter to grant researchers access to the Twitter Archive [17]. However, all that glitters is not gold and the agreement had an important caveat: no substantial portions of the archive could be available for downloading [1] what meant that researchers would need to physically access the archive in order to perform any research. In addition to that, the amount of tweets was so massive (170 billion tweets) that the Archive supposed a huge technological challenge and real time queries were out of the question5. Hence, at the moment of this writing the Twitter Archive at the Library of Congress 1If you are a regular to WWW, ICWSM, CHI, CIKM, ACL, HICSS, WSDM, EMNLP or LREC you are painfully aware of that; if not please refer to [18]. See also [2] for a rationale about preserving Twitter as an important cultural artifact of our civilization. 2For instance, user profiles at Twitter are extremely sketchy and basic when compared to Facebook ones. 3I do not forget the Twitter Data Grants that allowed a limited number of teams access to substantial amounts of Twitter data [9]; however, I consider them a flash in the pan given that they have not been offered anymore and, on top of that, only 6 out of 1,300 teams (0.46%) were awarded with one of them [10]. 4Certainly you can release lists of tweet ids but that means that other researchers need to recollect the data again, and thus, it cannot be properly considered as data sharing; still, it is the major if not only approach to Twitter data sharing at this moment–e.g., [12]. 5According to [1] a single search could take up to 24 hours to run.
منابع مشابه
Tweets as Sources in the History of Contemporary Science
What was once the ultimate in the fleeting world of ephemera, the tweet, is now being archived by the most august library in the land. The Library of Congress is archiving about 500 million tweets per day, up from 140 million per day just two years ago. As Twitter processes 58 million tweets per day, this is still only about ten times the number of tweets that continue to be tweeted every day. ...
متن کاملConcluding Remarks of the Second International Congress on Traditional Medicne & Materia Medica, 4-7 oct. 2004, tehran, iran
Plants are one of the great resources of the world, and as humankind has evolved, socially, spiritually, and economically, we have found, collectively, a myriad uses for the plants around us. One of those uses is for the prevention, treatment, and cure of various disease states. Documented knowledge about such use dates back at least 4000 years, and several of the plant mentioned in the ancie...
متن کاملConcluding Remarks of the Second International Congress on Traditional Medicne & Materia Medica, 4-7 oct. 2004, tehran, iran
Plants are one of the great resources of the world, and as humankind has evolved, socially, spiritually, and economically, we have found, collectively, a myriad uses for the plants around us. One of those uses is for the prevention, treatment, and cure of various disease states. Documented knowledge about such use dates back at least 4000 years, and several of the plant mentioned in the ancie...
متن کاملInvestigating the Structure and Organization of Hospitals in Islamic Civilization (From the middle of the second century to the middle of the eighth century AH)
Muslims learned how to build a hospital using the experiences of physicians from other nations, especially Iranians, by modeling at Jundishapur Hospital, and this way set up many hospitals. In addition to building a variety of hospitals, Muslims created efficient structures and organized them based on bosses, deputies, stewards, supervisors, nurses, and the like, who served in different parts o...
متن کاملWhat ain’t mathematics education?!
Abstract: In 1996 at the first Iranian Mathematics Education Conference (IMEC1) that was held in Isfahan. I obliged myself as a mathematics educator, to inform the mathematics community at large by presenting a paper entitled “what is mathematics education?” to pave the way for the establishment of the master program of mathematics education in Iran. Now, after 16 years, we need to reflect on t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1611.08144 شماره
صفحات -
تاریخ انتشار 2016